12 research outputs found

    Multilingual Language Processing From Bytes

    Full text link
    We describe an LSTM-based model which we call Byte-to-Span (BTS) that reads text as bytes and outputs span annotations of the form [start, length, label] where start positions, lengths, and labels are separate entries in our vocabulary. Because we operate directly on unicode bytes rather than language-specific words or characters, we can analyze text in many languages with a single model. Due to the small vocabulary size, these multilingual models are very compact, but produce results similar to or better than the state-of- the-art in Part-of-Speech tagging and Named Entity Recognition that use only the provided training datasets (no external data sources). Our models are learning "from scratch" in that they do not rely on any elements of the standard pipeline in Natural Language Processing (including tokenization), and thus can run in standalone fashion on raw text

    Metal Fluorides as Analogs for Studies on Phosphoryl Transfer Enzymes

    Get PDF
    The 1994 structure of a transition state analog with AlF4- and GDP complexed to G1, a small G protein, heralded a new field of research into structure and mechanism of enzymes that manipulate transfer of the phosphoryl (PO3-) group. The list of enzyme structures that embrace metal fluorides, MFx, as ligands that imitate either the phosphoryl group or a phosphate, is now growing at over 80 per triennium. They fall into three distinct geometrical classes: (i) Tetrahedral complexes, based on BeF3-, mimic ground state phosphates; (ii) Octahedral complexes, primarily based on AlF4-, mimic "in-line" anionic transition state for phosphoryl transfer; and (iii) Trigonal bipyramidal complexes, represented by MgF3- and putative AlF30 moieties, additionally mimic the tbp geometry of the transition state. The interpretation of these structures provides a deeper mechanistic understanding of the behavior and manipulation of phosphate monoesters in molecular biology. This review provides a comprehensive overview of these structures, their uses, and their computational development. It questions the identification of AlF30 and MgF4= as tbp species in protein complexes and discusses the relevance of physical organic chemistry and water-based model studies for understanding phosphoryl group transfer in enzymes. It describes two roles for amino acid side-chains that mediate proton transfers during phosphoryl transfer, based on the analysis of protein/MFx structures. First, they deploy hydrogen bonding to neutral oxygen nucleophiles so as to orientate them for correct orbital overlap with the electrophilic phosphorus center. Secondly, they behave as classical general acid/base catalysts

    D2.2.5 MineSet TM

    No full text
    MineSetTM is a commercial data mining product from Silicon Graphics. It provides an interactive platform for data mining, integrating three powerful technologies: database and file access, analytical data mining engines, and data visualization. MineSet supports the knowledge discovery process from data access and preparation through iterative analysis and visualization to deployment. MineSet uses a clientserver architecture for scalability and support of large data. The data access component provides a rich set of transformations that can be used to process stored data into forms appropriate for visualization and analytical mining. MineSet’s 2D and 3D visualization capabilities allow direct data visualization for exploratory analysis. The analytical mining algorithms create models that can be viewed using visualization tools specialized for the learned models or deployed as part of a larger system. Third party vendors can interface to the MineSet tools for model deployment and for integration with other packages

    Pruning Decision Trees with Misclassification Costs

    Get PDF
    decision tree classifiers in two learning situations: minimizing loss and probability estimation. In addition to the two most common methods for error minimization, CART\u27S cost-complexity pruning and C4.5\u27~ errorbased pruning, we study the extension of cost-complexity pruning to loss and two pruning variants based on Laplace corrections. We perform an empirical comparison of these methods and evaluate them with respect to the following three criteria: loss, mean-squared-error (MSE), and log-loss. We provide a bias-variance decomposition of the MSE to show how pruning affects the bias and variance. We found that applying the Laplace correction to estimate the probability distributions at the leaves was beneficial to all pruning methods, both for loss minimization and for estimating probabilities. Unlike in error minimizat,ion, and somewhat surprisingly, performing no pruning led to results that were on par with other methods in ternis of the evaluation criteria. The main advantage of pruning was in the reduction of the decision tree size, sometimes by a factor of 10. While no method dominated others on all datasets, even for the same domain different pruning mechanisms are better for different loss matrices. We show this last result using Receiver Operating Characteristics (ROC) curves
    corecore